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Abstract 


The reliability of a scaled score can be computed by use of item response theory. Estimated 
reliability can be obtained even if the item response model selected is not valid. 
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Reliability of scaled scores can be determined by use of item response theory (Kolen, Zeng, 

& Hanson, 1996). This approach involves some risk, for the item response model employed may 
well not hold. An alternative approach based on item response theory does not assume that the 
model is true (Haberman, 2007). If the model is true, then the two approaches should give very 
similar results in large samples, so that an indication of the impact of model error is provided. 

In Section 1, conventional computation of reliability of a scaled score is reviewed. In Section 2, 
the proposed method of reliability computation is considered. In Section 3, some examples of 
computations are provided for operational tests. Implications are explored in Section 4. 

Throughout the paper, it is assumed that examinee responses X^, 1 < j < q, 1 < i < n, are 
available, where examinee i, 1 < i < n, has response to item j, 1 < j < q. It is assumed 
that q is at least 2. The response vectors Xj with coordinates X tJ , 1 < j < q. are assumed to be 
independent and identically distributed. Each response X tJ is in a finite set Aj of real values, 
where Aj has at least two elements. The possible values of X; may be denoted by A. In many 
applications, Aj = {0,1}, X^ = 1 if the response is correct, and X t j = 0, otherwise. In formula 
scoring with aj multiple choices, one might have Aj = { —1 /((ay — 1), 0,1)} with X^ = 1 for 
a correct response, X^ = 0 for an omitted response, and X^ = —1 /(aj — 1 ) for an incorrect 
response. For a g-dimensional vector y, let E(y) be the sum of the coordinates of y. Then the 
sum Si = S(Xj) of the Xij, 1 < j < q, is the raw score for examinee i. The finite set of possible 
raw scores is denoted by E(A). To any possible raw score s in E(A) corresponds a real scaled 
score U(s). A one-dinrensional item-response model is considered for the responses X,;, 1 < i < n. 
Under the model, it is assumed that for some /i-dinrensional vector (3 in a set B with a nonempty 
interior and for some family of probability distributions P( 7 ), 7 in B, any ^-dimensional vector 
x in A with coordinate Xj, 1 < j < q, the probability p(x) that X. ; ; = x is the expected value 
p*(x;/3) of 

<? 

p*(x|6»;/3) = Y[pj(xj\0-,f3), 

3 =! 

where pj(xj\9; (3) > 0, the sum of the pj(x\9;/3), x in Aj, is equal to 1, and 6 has a probability 
distribution P(f3). Thus for a random variable 9j, the Xij, 1 < j < q, are conditionally 
independent given 9i, 9i has distribution P{(3 ), and the conditional probability that Xij = Xj 
given 9i = 9 is pj(xj\9; (3). If defined, the expectation of a real function g(9i) of 9i is denoted by 
E(g(9)] (3). It is assumed that pj(xj\9; (3) is continuous in both 9 and (3. The probability ps(s), s 
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in S(t 4), that the raw score S) = s is the sum of the probabilities p(x) for response combinations x 
in A such that the sum E(x) = s. Thus the model assumes that the probability ps(s ) = ps*(s; (3), 
the corresponding sum of the probabilities p*(x; /3 ) for response combinations x in A such that 
the sum E(x) = s. Computation of ps*(s-,(3 ) is not generally difficult if appropriate recursive 
algorithms are used (Kolen et al., 1996). 

Even if the model does not hold, it is assumed that a unique /3* exists that minimizes the 
information measure E(— logp*(Xi, (3*)) over B (Haberman, 2007). If the model does hold, then 
/3* = (3. The parameter (3 * can be regarded as the value in B that leads to p*(x;/3*) that best 
approximate p(x) for x in A. 

In practice, (3* is estimated from the X,; by maximum likelihood. The maximum-likelihood 
estimate of (3 is denoted by (3. It is assumed that the customary results hold that [3 converges to 
/3* with probability 1 and P(j3*) converges weakly to P((3*) with probability 1. It then follows 
that E(g(9)](3) converges with probability 1 to E(g(9);/3) if g is bounded and continuous. 

1 Computation of Reliability of the Scaled Score 

If the model holds, then the scaled score Ui = U(Si ) has conditional variance given 9 t = 9 
(conditional variance of measurement at 9) of 

a\U\9;(3)= £ [U(s) - g(U\9; (3)} 2 p s *(s\9; (3), 

ses {A) 

where the conditional scale score mean given 9 t = 9 is 

p(U\9;(3)= U(s)ps*(s\9;{3). 

seS(A) 

The expected conditional variance given 9i = 9 (variance of measurement) is then E(cr 2 (U\9; (3)\ (3). 
The conditional standard error of measurement is the square root of the conditional variance of 
measurement. If the expected value of g(U\9i](3) is denoted by 

p{U-f3) = E{p{U\9- 1 f3)-(3) 1 


then the variance of p{U\9i\ (3) is 


a 2 (g(U\9;f3y,{3) = E([p(U\9; (3) - p(U; /3)] 2 ; f3). 
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Thus the reliability of the scaled score is 


p 2 (U;(3) 


a 2 {p(U\e-(3)-(3) 
^(U\9;f3) + ^(p(U\9;f3);f3) 


(Kolen et al., 1996). Because the model is assumed to hold, p,(U;/3) is the expected scaled 
score E(Ui), and the variance u 2 {Ui ) of the scaled score is a 2 (U\6;/3) + a 2 (p(U\9; /3); (3). The 
maximum-likelihood estimate E(a 2 (U\9; (3); (3) of the variance of measurement converges with 
probability 1 to the variance of measurement E(a 2 (U\9; /3); (3). The maximum-likelihood estimate 
p 2 (U](3) of the reliability converges to the actual reliability p 2 (U](3) with probability 1. If P((3) 
is the distribution of a polytomous random variable, then the required expectations are readily 
computed. If P(/3 ) is the standard normal distribution, as is quite common in item response 
theory, then Gauss-Hermite quadrature may normally be employed to find expectations. 

At ETS, computation of reliability of a scaled score is accomplished by a much older and 
much cruder approach (Dorans, 1984) based on local linear approximations of scaled scores by raw 
scores and based on approximation of the distribution of true scores by the empirical distribution 
of raw scores. 


2 Alternative Computation of Reliability of the Scaled Score 

An alternative approach to computation of the scaled score does not assume that the model 
is correct. In this approach (Haberman, 2007), a random variable 9 t * has distribution P(/3), and a 
random vector Xj* with the same possible values as X.; has conditional probability p(x.\9;(3) that 
Xj* = x given = 9. A random variable 9i then exists such that the conditional distribution of 
9i given X.; is the same as the conditional distribution of 9 ** given Xj*. No assumptions are made 
concerning the distribution of X,; other than that each probability p(Xj) is positive. Let p denote 
the function on A with value p(x) at x in A. If g is a real function on the real line and if the 
expected value E(g(9)\ (3*) of g{9^) is defined, then the expected value of g(9i) is the expected 
value 

E*{g{9); p, /3*) = E(c(9; p, (3*)g{9); /3*) 
of the product c(0**; p, (3 if )g{9^). Here the multiplier 

c(9 ; P, A) = ^2 d( y e ' x ’ P> P*)’ 

xeA 
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and the summand 


d(0‘, x, p, /3*) = 


P( X )P*( X |W* 


P*( x ;AJ 

The conditional probability that X* = x given 9i = 9 is then d(Q; x, p, /3*)/c($; p, (3*). If the model 
is valid, then the summand d{9\ p,x;/3*) is the conditional probability p*(x|0;/3) that X,; = x 
given 9i = 6, and the multiplier c(0; p,/3*) is 1. Thus the expected value E*(g{6)\ p, /3*) of g(0j) is 
the same as the expected value of g{9i *). 

The scaled score U t has conditional variance given 9 t = 9 (conditional variance of measurement 
at 9) of 

= [c(6»; p,/3J] -1 ^[[/(S(x)) - /x*(£/|0; p, (3 if )] 2 d{9\ x, p, (3), 

xeA 

where the conditional scale score mean given 9i = 9 is 


»*(U\9\(3J = Y U(T,(x))d(9\x,p,(3). 

xeA 

The expected conditional variance given 9{ = 9 (variance of measurement) is then 
E*(a 2 (U\9; p, /3*); p, /3*). The square root of the conditional variance of measurement is the 
conditional standard error of measurement. The expected value of n*(U\6i; (3*) is the expected 
value 

S(t/;p) = ^C/(S(x))p(x) 

xeA 

of Ui, so that the variance of /x*(U|0; p, (3) is 


P, f3*)',P, (3*) 


E*([n*(U\9] p, /3*) — E(U; p)] 2 ; p, /3*)- 


Let 

°HU; P) = J>(E( X )) -E(U; p)] 2 p(x) 

xeA 

denote the variance of U{. Based on the new variable 9i, the reliability of the scaled score is 

2 (T T R N _ ^(^(C/|6»;p,/3J;p,^) 

P * K ’ V,P *t a 2 {U- p) 

If p is the fraction of examinees i, 1 < i < n, with Xj = x and if p is the function on A 
with value p(x) for x in A, then the estimated variance cr 2 (U] p) converges with probability 1 
to the variance a 2 (U;p) of U t . The estimated variance of measurement E*(<j 2 (U\9; p, f3; p, $)) 
converges with probability 1 to the variance of measurement E*(<j 2 (U\9] p, /LJ; p, /3*). It follows 
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that the estimated reliability p 2 (U;p,$) of the scaled score converges to the the actual reliability 
p 2 (U,p, (3*) with probability 1. If the model holds, then p 2 (U,p, f3) and p 2 (U,/3) both converge to 
the same value. 

In almost any real test, the vast preponderance of the p(x) are 0, for the number of possible 
combinations of responses is far larger than the sample size. For any real function h on A, 
the average S xe u M x ) is equal to n~ l YHi =i In evaluation of variances and reliability 

coefficients, it is helpful to observe that 


c(0; P,/ 3 ) 


= n l Y, 


P*(*uP) ’ 


and 


<rl(U\0\p-,pP) = [c(d’,p,j3)] l n 1 ~ P*(U\9; p, p)\ 


i= 1 
n 


2 

p*(Xi;f3) 


^(U\e-,f3,) = n~ 1 J2U 2 


P*(Xi|g;/3) 
' P*( x d^) 

n 

E{U ] p)=n- l Y J Un 

i= 1 


a 2 (U-,p) = n-^lUi - E(U-,p)} 2 . 

i —1 


3 Examples 

To illustrate results, reliability computations were made for the scaled scores reported for 
two forms associated with an ETS assessment. The first form, Form 1, involved about 2,700 
examinees and the second form, Form 2, involved about 3,000 examinees. For each form, two 
sections, Section A and Section B, are considered. For each section, the responses of the examinee 
for that section are used to construct a raw score total for the section, and the raw score is then 
converted to a reported scale score for the section. Two approaches were considered. In the first 
approach, a two-parameter logistic (2PL) model was employed for dichotomous responses with 
possible values 0 or 1, and a generalized partial credit model was used for polytomous responses. 
Because both Section A and Section B involved item sets, a second approach was considered in 
which a generalized partial credit model was applied to the raw score subtotals for each item 
set. In addition to the IRT analysis, Cronbach a statistics were computed for each total raw 
score for each section. Two methods are were to compute the Cronbach a. The first method 
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based computations on individual item responses. The second approach based computations on 
raw subtotals for each set of items. Results are provided in Tables 1, 2, and 3. For comparison, 
note that output from statistical analysis performed during equating of Form 2 yielded reliability 
estimates of 0.871 for the scaled score for Section A on Form 2 and a reliability estimate of 0.888 
for the raw score for Section A. For Form 2, the estimated reliability for Section B was 0.870 for 
the scaled score and 0.872 for the raw score. As previously indicated, these computations are 
based in Dorans (1984). 


Table 1 

Reliability Estimates 


Estimate 

Score 

Basis 

Form 1 

Section A Section B 

Form 2 

Section A Section B 

Model assumed 

Scale 

Items 

0.907 

0.896 

0.892 

0.876 

Model not assumed 

Scale 

Items 

0.904 

0.896 

0.874 

0.862 

Model assumed 

Raw 

Items 

0.909 

0.898 

0.892 

0.882 

Model not assumed 

Raw 

Items 

0.905 

0.898 

0.887 

0.871 

Cronbach a 

Raw 

Items 

0.893 

0.892 

0.873 

0.863 

Model assumed 

Scale 

Sets 

0.880 

0.877 

0.863 

0.858 

Model not assumed 

Scale 

Sets 

0.880 

0.879 

0.859 

0.849 

Model assumed 

Raw 

Sets 

0.882 

0.879 

0.869 

0.865 

Model not assumed 

Raw 

Sets 

0.881 

0.881 

0.869 

0.857 

Cronbach a 

Raw 

Sets 

0.878 

0.870 

0.865 

0.839 


Table 2 

Standard Errors of Measurement 

Estimate 

Score 

Basis 

Form 1 

Section A Section B 

Form 2 

Section A Section B 

Model assumed 

Scale 

Items 

2.156 

2.239 

2.191 

2.179 

Model not assumed 

Scale 

Items 

2.165 

2.241 

2.214 

2.207 

Model assumed 

Raw 

Items 

2.763 

2.349 

2.874 

2.041 

Model not assumed 

Raw 

Items 

2.771 

2.344 

2.881 

2.040 

Cronbach a 

Raw 

Items 

2.944 

2.413 

3.049 

2.108 

Model assumed 

Scale 

Sets 

2.423 

2.416 

2.338 

2.287 

Model not assumed 

Scale 

Sets 

2.420 

2.415 

2.336 

2.310 

Model assumed 

Raw 

Sets 

3.105 

2.539 

3.097 

2.144 

Model not assumed 

Raw 

Sets 

3.100 

2.531 

3.092 

2.147 

Cronbach a 

Raw 

Sets 

3.142 

2.653 

3.142 

2.285 
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Table 3 

Standard Deviations 


Estimate 

Score 

Basis 

Form 1 

Section A Section B 

Form 2 

Section A Section B 

Model assumed 

Scale 

Items 

7.070 

6.929 

6.659 

6.179 

model not assumed 

Scale 

Items 

6.985 

6.950 

6.229 

5.937 

Model assumed 

Raw 

Items 

9.145 

7.352 

8.752 

5.951 

Model not assumed 

Raw 

Items 

9.003 

7.349 

8.554 

5.686 

Cronbach a 

Raw 

Items 

9.003 

7.349 

8.554 

5.686 

Model assumed 

Scale 

Sets 

7.007 

6.894 

6.613 

6.072 

Model not assumed 

Scale 

Sets 

6.985 

6.950 

6.229 

5.937 

Model assumed 

Raw 

Sets 

9.037 

7.302 

8.546 

5.829 

Model not assumed 

Raw 

Sets 

9.003 

7.349 

8.554 

5.686 

Cronbach a 

Raw 

Sets 

9.003 

7.349 

8.554 

5.686 


The various estimates are not dramatically different, but differences are still notable. The 
key issue appears to be the treatment of item sets. Given the same treatment of sets, estimated 
standard errors of measurement are quite similar for both IRT approaches. As should be 
expected, the Cronbach a statistics give larger estimated standard errors of measurement than 
do the corresponding IRT procedures. The IRT estimates are from 1 to 6% smaller. The issue 
of item sets is somewhat more notable. Set-based estimates are from 3 to 12% larger than are 
the corresponding item-based estimates. The estimated standard deviations of scores for the 
method of section 1 in which the model is assumed to hold are often quite close to those for the 
method of section 2 in which the model is not assumed to hold, but differences can exist. For 
example, consider the scaled score for Section B of Form 2 for the item-based estimate. The 
estimate that assumes model validity is about 7% larger than is the estimate that does not assume 
model validity. The reliability estimates are rather similar for different methods that provide the 
same treatment or lack of treatment of item sets. Conditional on method of estimate, including 
treatment of item sets, the reliability results for scale scores and raw scores are rather similar 
despite some nonlinearity of the raw-to-scale conversion in Form 2 and despite use of rounded scale 
scores. Effects of item sets are of some concern, especially if one considers percentage changes in 
terms of differences from 1. For example, in the case of the Cronbach a, for Section B of Form 2, 
the item-based result is about 15% closer to 1 than is the set-based result. 
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4 Conclusions 


It is quite feasible to estimate reliability of scaled scores with far fewer approximations than 
are currently used in ETS estimation procedures. At least for the cases examined, the effect of 
more accurate estimation is not dramatic but not negligible. 

The approach to item sets in the analysis is not necessarily the best one. Especially in 
Section A, in which item sets are quite large, it may be more appropriate to apply restrictive 
models on the parameters in the generalized partial credit model to overcome concerns about 
the small numbers of examinees with certain extreme raw subtotals. In addition, it is possible 
to consider testlet models for the treatment of item sets. The latter choice was avoided in this 
investigation due to the somewhat higher computational labor involved. Nonetheless, the role of 
testlet models does warrant study. 

The approach that does not assume model validity is not a perfect solution to invalid models, 
although comparison with the results that assume and do not assume validity can indicate 
model deficiencies. Nonetheless, analysis of item-based models was still not entirely successful at 
revealing set effects. Thus it is not realistic to expect that analysis that does not assume a valid 
model will inevitably lead to a satisfactory treatment of reliability. 
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